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Abstract 

According to the 1.5 views theorem (Poggio, 1990; Ullman and Basri, 1991) recog¬ 
nition of a specific 3D object (defined in terms of pointwise features) from a novel 2D 
view can be achieved from at least two 2D model views (in the data basis, for each 
object, for orthographic projection). In this note we discuss how recognition can be 
achieved from a single 2D model view. The basic idea is to exploit transformations 
that are specific for the object class corresponding to the object - and that may be 
known a priori or may be learned from views of other “prototypical” objects of the 
same class - to generate new model views from the only one available. The paper 
is organized in two distinct parts. In the first part, we discuss how to exploit prior 
knowledge of an object’s symmetry. We prove that for any bilaterally symmetric 3D 
object one non-accidental 2D model view is sufficient for recognition. We also prove 
that for bilaterally symmetric objects the correspondence of four points between two 
views determines the correspondence of all other points. Symmetries of higher order 
allow the recovery of structure from one 2D view. In the second part of the paper, 
we study a very simple type of object classes that we call linear object classes. Linear 
transformations can be learned exactly from a small set of examples in the case of lin¬ 
ear object classes and used to produce new views of an object from a single view. We 
also provide natural examples of linear object classes induced by symmetry properties 
of the objects. 
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1 Introduction 


Techniques have been recently developed that can learn to recognize a specific 3D object 
after a “learning” stage in which a few 2D views of the object are used as training examples 
(Poggio and Edelman, 1990; Edelman and Poggio, 1990). A lower bound on the number 
of views is provided by the 1.5 view theorem (Poggio, 1990; see also Ullman and Basri, 
1991 who pioneered the linear combination approach and Huang and Lee, 1989) that implies 
that 2 views - appropriately defined - may be sufficient in the orthographic case. Under 
more general conditions (perspective projection, more general definition of view, non uniform 
transformations etc.) many more views may be required (Poggio and Edelman’s estimate is 
in the order of 100 for the whole viewing sphere using their approximation network). 

Though this is an easily satisfied requirement in many cases, there are situations in which 
only one 2D view is available as a model. As an example, consider the problem of recognizing 
a face from just one view: humans can do it, even for different facial expressions (of course 
an almost frontal view may not be sufficient for recognizing a profile view and in fact the 
praxis of person identification requires usually a frontal and a side view). 

Clearly one single view of a generic 3D object (if shading is neglected) does not contain 
sufficient 3D information. If, however, the object belongs to a class of similar objects (pro¬ 
totypes), it seems possible to infer appropriate transformations for the class and use them to 
generate other views of the specific object from just one 2D view of it. We are certainly able 
to recognize faces which are slightly rotated from just one quasi-frontal view, presumably 
because we exploit our extensive knowledge of the typical 3D structure of faces. 

One can pose the following problem: is it possible from one 2D view of a 3D object to 
generate other views, exploiting knowledge of the legal transformations associated with objects 
of the same class? A positive answer would imply (for orthographic projection and uniform 
affine transformations) that a novel 2D view may be recognized from a single 2D model view, 
because of the 1.5 views theorem 1 . 

This note is divided in two distinct parts. In the first part we consider the case in which 
legal transformations for a specific object (i.e. transformations that generate new correct 
views from a given one) are immediately available as a property of the class. In particular, 
we will discuss certain symmetry properties. In the second part, we consider the problem of 
learning appropriate transformations from examples of other objects of the same class. 

The main results in the first part of the paper are: 

1. we prove that for any bilaterally symmetric 3D object (such as a face) one 2D model 
view is sufficient for recognition of a novel 2D view (for orthographic projection and 
uniform affine transformations). This result is equivalent to the following statement: 
for bilaterally symmetric objects a model based recognition invariant (as defined by 
Weinshall, 1992) can be learned from just one model 2D view; 

2. we also prove that for symmetries of higher order (such as two-fold symmetries, i.e. 
bilateral symmetry with respect to two symmetry planes) it is possible to recover 
structure from one 2D view. 

positive answer would also make possible the use of other recognition techniques such as Poggio and 
Edelman’s technique - and its extensions, possibly including the correlation based version (Brunelli and 
Poggio, 1991) - by using the newly generated views as a training set. 


2 



In the second part of the paper we first argue that transformations that generate addi¬ 
tional model views from a single view may be learned at least approximatively from examples 
of objects of the same class. We then 

1. introduce the definition of “linear classes”, 

2. show that for linear classes one 2D model view is sufficient to generate exact additional 
views (and therefore to perform recognition of a novel view); 

3. discuss examples of linear classes and prove that object symmetries induce a natural 
set of linear classes: for instance, bilaterally symmetric objects are a linear class. 

In the final section,, we briefly mention some the implications of our results for the practical 
recognition of bilaterally symmetric objects such as faces, for human perception of 3D struc¬ 
ture from single views of geometric objects and, more generally, for the role of symmetry 
detection in human vision. 


2 PART I: Object Symmetries Recover Recognition 
and Structure from One 2D View 

2.1 Recognition from One 2D Model View 

Suppose that we have a model 2D view of an object. Assume further that (a) we know a 
priori that the object is bilaterally symmetric (for instance because we identify the class to 
which it belong and we know that this class has the property of bilateral symmetry) and 
(b) we know a pair of symmetric points in the 2D view. For the purpose of this first part 
we define an object to be bilaterally symmetric if the following transformation of any 2D 
view of a pair of symmetric points of the object yields a legal view of the pair, that is the 
orthographic projection of a rigid rotation of the object 
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Notice that symmetric pairs are the elementary features in this situations and points lying on 
the symmetry plane are degenerate cases of symmetric pairs. Notice also that our definition 
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of symmetry is in the same spirit as its use in physics, where symmetries of an abstract 
object are typically defined in terms of properties of the object under an appropriate set of 
transformations. 

Geometrically, this simply means that for bilaterally symmetric objects simple transfor¬ 
mations of a 2D view yield other views that are legal. The transformations are similar to 
mirroring one view around an axis in the image plane, as shown in Figure 1 top (where 
the left image is “mirrored” into the right one) and correspond - but only for a bilaterally 
symmetric object - to proper rotations of a rigid 3D object and their ortographic projection 
on the image plane. 

Equation 1 defines one such transformation and generates an additional view from the 
one model view (and the knowledge of bilateral symmetry). The 2 views x pair and x* ir are 
linearly independent, unless x pair = Ax^ ir , which is equivalent to the condition that x pair is 
the solution of the eigenvalue problem 

Dx^ir — \Xpair, 

that is, unless Xp air is a view which is left invariant(modulus a sign) by the symmetry oper¬ 
ation D. The eigenvalue problem has exactly two solutions (with A = ±1) which correspond 
to “accidental” views such as a perfectly frontal view, an exact side view, and reflection about 
the image plane (obtainable for “transparent” objects by a tv rotation). These x are the only 
ones for which the symmetry operation D does not provide a linearly independent new view. 
The same argument can be repeated for all symmetric pairs (points on the symmetry axes 
are of course a degenerate case of a pair) and all transformations. 

Thus, bilateral symmetry allows the generation of an additional, linearly independent 
view of the object. The 1.5 views theorem (see Appendix A.5) can then be used to compute 
the 3D basis that spans the spaces V x and V y of the object. Recognition of any view of the 
object is then possible. We have thus proved 

Theorem 2.1 A single 2D view of a bilateral symmetric object (containing at least 2 sym¬ 
metric, nondegenerate pairs, once translations are factored out) yields a three dimensional 
basis for the vector spaces V x and V y provided that the view is not an “accidental” view, i.e. 
is not a solution of Dx = ±x. 

Notice that bilateral symmetry provides from one 2D view a total of eight 2D views, each 
corresponding to a different rotation of the original 3D view. Four of the eight views are 
linearly independent (two linearly independent vectors of the x coordinates and two for the 
y coordinates). 2 Moses and Ullman (1991) derived a result about recognition functions of 
symmetric objects that is consistent with our theorem and complements it. 

Notice that it is also possible to define bilateral symmetry for the 3D object and then 
show that this definition yields the one above in the following way. Let us call a 3D object 
bilaterally symmetric if there exist a position and orientation of the object relative to a given 
3D cartesian coordinate system for which each feature point 

The depth ambiguity of any 2D view of “transparent” objects correspondes to a rotation of a single rigid 
object (notice in the non symmetric case the two views cannot be interpreted as a rotation of a rigid object) 
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has either x = 0 or a symmetric point x 2 , such that 


x 2 = 


It is easy to verify that there is a rotation ^ around the z axis in 3D under which the 
vector of the coordinates of the two symmetric points transforms into 


whereas a rotation of — ip maps it into 


x • — 

pair 


2.1.1 A Recognition Algorithm 

A single 2D model view together with knowledge that the object is bilaterallly symmetric 
can be used for recognition (in the same spirit as Ullman and Basri, 1991) in the following 
way. 

1. Take Xi and yi (the vectors of the x and y coordinates of the n feature points) from 
the available view and generate a third vector x 2 (or y 2 ) by applying the symmetry 
transformation D to Xi (or y^. 

2. Make a 2n x 6 matrix B with its 6 columns representing a basis for © Vf*. 

An explicit form of B is 


B = 


xi x 2 y 1 0 0 0 

0 0 0 xi x 2 y x 


3. Check that B is full rank (for instance (B T B) 1 exists). If B is not full rank try others 
of the legal views induced by symmetry. 

4. A novel view t (we assume here that the first n components are the x coordinates 
followed by n y) of the same object must be in the space spanned by the columns of 
f?, and therefore must satisfy 
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Figure 1: Given a single 2D view (upper left), a new view (upper right) is generated under 
the assumption of bilateral symmetry. The two views are sufficient to verify that a novel 
view (second row) corresponds to the same object as the first. 


t = Ba 


which implies 


(since 


(B T B) 1 exists) 


t = B(B T B) 1 B t t (2) 

B can then be used to check whether t is a view of the correct object or not, by 
checking whether ||t - B(B T B)~ 1 B T t|| = 0 or not (a further test for rigidity may also 
be applied, if desired, to the three available views). Figure 1 shows the results of using 
this technique to recognize simple pipe-cleaner animals. 


2.2 Structure from One 2D Model View 

Suppose, as before, that we have a single 2D view of an object. Assume further that we 
hypothesize (correctly) that the object is twice bilaterally symmetric (we assume in the 
present notation that x,y are the image coordinates and z is ortogonal to the image plane) 
and that symmetric quadropoles can be identified, that is sets of four points (they are 
the “elementary“ features in this situation, since any point, which is not on both symmetry 
planes, corresponds to 3 other points). We define an object to be twice bilaterally symmetric 
if the following transformations of any 2D view of a feature quadropole yield legal views of 
the quadropole, that is orthographic projections of rigid rotations of the object: 
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Figure 2: A single 2D view (upper left) of a twice bilaterally symmetric object can generate 
additional views (upper center and right) using the symmetric properties of the object. Those 
three views are sufficient to compute 3D structure, as indicated in the second row where we 
project the 3D structure computed from the 3 views above. 
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These 3 views are independent apart from special views, such as accidental views (see pre¬ 
vious section). Thus the above definition of symmetry provides a way to generate two 
additional views from the given one view, unless x quadr is a view which is left invariant by 
at least one of the symmetry transformations D { . This is the case, for instance, for exactly 
frontal views. The same argument can be repeated for all symmetric quadrupoles. 

Thus, this transformations yields in the generic case to 3 independent views of the object 
(the symmetry yields a total of 16 views, representing 16 different orientations of the object, 
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Figure 3: A single 2D view (upper row) of a bilaterally symmetric object can be generated by 
different bilaterally symmetric 3D objects. The three objects projected in the second row all 
generate the 2D view of the first row after a rotation of 20° around the vertical axis. 

which span the 6 dimensional viewing space of the object). One can verify that standard 
structure-from-motion techniques (Huang and Lee, 1989; see also Ullman, 1979) can be 
applied to conclude that structure is uniquely determined up to a reflection about the image 
plane 3 . The following holds: 

Theorem 2.2 Given a single 2D orthographic view of a twice bilaterally symmetric object 
(with at least 2 symmetric, nondegenerate quadropole features containing a total of at least 
four non-coplanar points) the corresponding structure is uniquely determined up to a reflec¬ 
tion about the image plane. 

In addition, the following results can be easily derived: 

1. 3D structure can be obtained from two 2D view of a bilaterally symmetric object. 

2. Structure cannot be uniquely obtained from a single 2D view of a bilaterally symmetric 
object. So a single 2D view of an bilaterally symmetric object can be generated by 
different bilaterally symmetric objects (see for example figure 3). 

2.3 Correspondence and Bilateral Symmetry 

Let us suppose that the correspondence of 4 non coplanar points (or more) between two 
views (the model view and the noval view) is given (as in A.6) and the object belongs to the 
class of bilaterally symmetric objects. Then the argument of Appendix 6 can be applied to 
each of the two views generated by the model view and the assumption of bilateral symmetry 
(see equation 1). For each point in the first view the corresponding point ( x,y) in the second 
view satisfies then the two equations: 

3 The W matrix defined by Weinshall (1992) is full rank in this case. It is rank deficient for simple bilateral 
symmetry. 
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Figure 4: Given a single 2D view (upper left), a second view (upper right) is generated by 
exploiting under the assumption of bilateral symmetry. Four corresponding points (lower left) 
are sufficient to obtain full correspondence between the model view (top left) and the novel 
view (lower right), of the same 3D object undergoing a uniform affine transformation. 


y = mx -f A 
y = m'x + A' 

and is therefore uniquely determined (apart special cases) as 

m’A — mA' A! — A 

y — ) 5 x — ~ 

m—m m — m' 

Thus the correspondence of 4 non coplanar points between two 2D views of a bilaterally 
symmetric object (undergoing a uniform affine transformation) uniquely determines corre¬ 
spondence of all other points. 

Figure 4 shows an example of obtaining full correspondence between a model view and 
a novel view given just four matched points and bilateral symmetry. 

3 PART II: Learning Transformations 

The key idea that motivated the work described in this paper is to use appropriate transfor¬ 
mations to generate new views from a single 2D view. Such transformations may be known 
a priori as a property of the object. This is the case discussed in Part I of this paper where 
the symmetry properties of the objects provide the transformations. They can also be syn¬ 
thesized in various ways. Poggio (1991) describes a few simple techniques such as the use 
of 3D models whose parameters are estimated from the single 2D view. The 3D model can 
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be then transformed (for instance rotated) and new views thereby produced. This technique 
has been already used for image compression (see Aizawa, Harashima and Saito, 1989 ). 

We are interested in a different approach. The general idea is to use an approximation 
technique, such as HyperBf networks, to learn an appropriate specific transformation from 
a set of examples of objects of the same class. For instance, we may learn a specific trans¬ 
formation that changes expression (from serious to smiling, say) of a face, using a set of 
examples consisting of pairs of 2D views of faces (each pair consists of two views of the same 
face, once serious and once smiling). In this section, we consider a more restricted set of 
transformations, uniform affine transformations of 2D views of objects (see Appendix for 
definitions), such as rotations, in order to begin to characterize their learnability. In the 
case of faces one such transformation would be a specific rotation, for instance, from +30° 
to 0°. It is worth emphasizing that the transformations we consider here are very specific 
(from a ceratin specific pose to another). The situation is quite different from Part I, where 
we were not interested in learning transformations and we were not restricted to specific 
transformations. 

We first introduce a very specific definition of object classes that we call linear object 
classes , for which it is easy to show existence and learnability of exact transformations. We 
do not believe that this is the best or most powerful definition of object classes. Its main 
merit is that it is simple and easy to analyse. We believe that other definitions should also be 
studied and that their computational and psychophysical relevance should be characterized. 


3.1 Linear Object Classes 

Consider a 3D view Xq of object 0. Assume that Xq C 9ft 3n is the linear combination of 
frontal views of q 3D views of other objects of the same dimensionality, that is 

X 0 = ^2 a i^i (5) 

*=1 

X 0 is then the weighted average of q points in a 3ra dimensional space. Consider now the 
operator L r associated with a desired uniform transformation (see Appendix) such as for 
instance a specific rotation in 3D. Let us define X- = L r X { the rotated 3D view of object i. 
Because of linearity of the group of uniform linear transformations £, it follows that 


x; = X>x;. 

*=l 

Thus, if a 3D view of an object can be represented as the weighted sum of views of other 
objects, its rotated view is a linear combination of the rotated views of the other objects with 
the same weights. The same statement also holds for the corresponding 2D views, obtained 
from the 3D views under orthographic projection (see Appendix), that is 


implies 


9 

X 0 = J2 a *' X *' 
*=1 
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q 

x o = E “i x l • 

t=i 

with x 0 = PX 0 , Xq = PXq, Xj = PX; and x£ = PX£. 

These relations suggest that we can use “prototypical” 2D views and their known trans¬ 
formations to synthesize an operator that will transform a 2D view into a new 2D view 
when the object is a linear combination of the prototypes. Notice that the decomposition 
of equation 5 is always possible if q > 3n, but that in general the decomposition cannot be 
found uniquely for one 2D view and the given prototypes. However, if q < 2n, then it is 
possible to recover the coefficients a;. This observation leads to: 

Definition of a linear object class 

A set of 3D views (of objects) {X;} is a linear object class if dim{Xi} < 2 n with 
X; € ft 3n . 

This is equivalent to say that all objects of the same class cluster in a small linear subspace 
of ft 2n spanned by 2 n prototypes. Edelman (1992) discusses closely related issues in the 
context of the complexity of recognition. 


3.2 How to Learn Transformations for Linear Object Classes 

First we compute the coefficients a for the optimal decomposition (in the sense of least 
square) of a “initial” view x 0 of an object $ into the “initial” views x t of the q given 
prototypes by minimizing 


we rewrite equation 6 as 


ll X 0 ~ Zl a i X il| 2 - 

»=1 


( 7 ) 


X 0 = Ea (8) 

where E is the matrix formed by the q vectors x» arranged column-wise and a is the column 
vector of the a coefficients. Minimizing equation 7 gives 


a = (E) + x 0 (9) 

The observation of the previous section implies that the operator that transforms x 0 into x r 0 
through Xq = Px 0 , is given by 


x o — *—* a — ‘=- r ^ + x 0 (10) 

as 

L = S'E + , (11) 

and thus can be learned from the 2D example pairs (x^xj). In this case, a one-layer, linear 
network (compare Hurlbert and Poggio, 1988) can be used to learn the transformation L. L 
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Figure 5: Four 2D views (top) of a 3D object rotated three times around a fixed axis, each 
time by 5°. From the resulting 3 pairs of 2Dviews, the transformation “rotation by 5° degrees’' 
can be learned in terms of the linear operator L. The lower row shows the effect of applying 
the transformation, iterated 10 times, to the upper right 2Dview. 

can then transform a view of novel object of the same class. If the q examples are linearly 
independent S + = (S T S) 1 S T and the minimization of equation 7 provides x 0 = a,,x t - 

3.3 Examples of Linear Object Classes and the Role of Symme¬ 
try 

A “small” number of prototypes 

Each set of 2 n linear independent objects defines a linear object class, which contains all 
their linear combinations. 

The space of a single object 

As recently discovered by Basri and Ullman (1989), the space spanned by all rotations of 
one object has dimension 6. The dimension is reduced to 3 if the rotations are limited to 
the rotation around one fixed axis. A few examples (6 or 3) therefore span the complete 
view space in which any novel view of the same object - obtained through a 3D rotation 
(or any uniform affine transformation in 3D), followed by orthographic projection - lies. It 
is possible to transform this transformed view again, and thus to compute all the 2D views 
generated by a stepwise 3D rotation (see figure 5). 

4 Under more general assumptions, however, such as perspective projection and use of other non-geometric 
features instead or in addition to the x, y coordinates of labeled surface points, we expect the mapping 
between “frontal” views and rotated views to be nonlinear. Techniques such as Hyperbf should then be used 
(Poggio and Girosi, 1990). 
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Objects with a symmetry structure 

3D objects with a common or partly common interior structure (e.g. symmetry, fixed angles 
(such as right angles) or fixed ratios between some feature points) may form a linear object 
class. The following result holds: A class of objects, each one represented by a special 3D 
view {x s }, Xs e ft 3 ", is a specific linear object class if the structure can be represented by 
a matrix S(3n,3n) with rank(S) < 2n and X 5 = SX 5 . 

Symmetric objects form a natural linear object class of this type. In the case of bilateral 
symmetry in the y, z plane {X s } is taken to be a “frontal” 3D view of the object and S can 
be written as: 



where S pZ (3p, 3 p) defines the structure of p points in the symmetry plane and S w (2 x 36,2 x 36) 
of 6 pairs of symmetric points. Both S p / and S& can be written in a diagonal form 
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The final dimension d of such a class {Xs} is determined by the number p of feature 
points in the symmetry plane and the number 6 of symmetric feature pairs. Feature points 
in one symmetry plane reduce the dimension from 3 p to the upper limit of 2 p. Points and 
their symmetric counterparts reduce the dimension from 2 x (36) to 36. 


3.3.1 Learning the Transformation Component by Component 

In the previous section we considered learning the appropriate transformation from full views. 
In this case the examples (prototypes) must have the same dimensionality as a full view. 
Our arguments above show that dimensionality determines the number of example pairs 
needed for a correct transformation. This section suggests that components of an object — 
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Figure 6 : Two examples of symmetric objects. The 3 symmetry planes of a cuboid reduce the 
effective dimensions of the object space of all cuboids from 2j to 3. The bilateral symmetry 
of objects consisting of 9 feature points with 5 points on the symmetry plane (see right figure) 
reduces the dimensions from 27 to 16. 

i.e. a subset of the full set of features - that are element of the same object class may be 
used to learn a single transformation with a reduced number of examples, because of the 
smaller dimensionality of each component. The basic components in which a view can be 
decomposed are given by the irreducible submatrices 5; of the structure matrix S so that 
S — ® .... 0 Su. 

Consider again the linear class of bilaterally symmetric objects. The “diagonal” structure 
of S with only two submatrices is preserved after a linear transformation of the feature points 
in Sf? 3 : 


L 3n X s = L 3n SX s 
X r s = S r X s 

• 

This shows that the problem of transforming the 2D view x$ of the 3D objects X 5 into the 
transformed 2D views xj, can be treated separately for each component of X 5 . For simplicity, 
we deal in the following only with symmetric pairs of feature points (points on the symmetry 
plane are degenerate pairs). The components are determined by the submatrix s& on the 
diagonal of S and are the 2D coordinates of a pair of bilaterally symmetric points x^. The 
constraint X 5 = SX 5 leads to : 


Xjyi Sfo Xfo 

This equation is equivalent to equation 5. Therefore the linearly independent column vectors 
of s bi span the 3 dimensional space of a pair of symmetric points. It follows that 3 examples 
are sufficient to learn a transformation of a pair of bilateral symmetric points (using a linear 
network). 

This observation shows the dramatic decrease in the number of examples necessary for 
learning the specific transformation if a bilateral symmetric object is transformed by compo¬ 
nents. In this case a single basic component consists of a symmetric pair of points. A total 
of 3 examples for each pair of points, are sufficient to learn a specific transformation (such 
as a rotation from a to (3 around a prespecified axis) of any bilaterally symmetric object. As 
shown earlier the lower limit for the number of examples is 1/2 x 3 n for a symmetric objects 
consisting of n points if the objects is transformed as a whole. 
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4 Concluding Remarks 

• Classifying a novel view in terms of an object class 

We have left open the question of how to classify the object from a novel 2D view. This 
is the first step for then inferring certain symmetry properties or for applying learned 
transformations. Notice that hypotheses about symmetries can always be attempted 
and tried out. 

• Identifying a symmetry pair (or a n-ple) 

The techniques of Part I require identification in the novel view of symmetry pairs (or 
quadropoles). Additional information may be available (e.g. once the two eyes are 
identified as eyes, it is known that they represent a symmetric pair). In other cases 
(e.g. line drawings of geometric objects) algorithms capable of identifying feature 
points likely to be symmetric should be feasible. Though we have not worked on this 
problem yet. It is intriguing to speculate about relations to known human abilities of 
detecting symmetries and with human tendencies of hypothesizing symmetry in visual 
perception. 

• Exact frontal model views should be avoided 

Our results about bilateral symmetry imply that one should avoid to use in the data 
base a model view which is a fixed point of the symmetry transformations (since the 
transformation of it generates an identical new view). In the case of faces, this implies 
that the model view in the data base should not be an exactly frontal view. 

• A symmetry of higher order than bilateral allows recovery of structure from one 2D 
view 

Our results imply that even when other cues that provide structure from 1 view (such 
as shading, perspective, texture etc.) are absent, an object symmetry of sufficiently 
high order may provide structure from a single view. An interesting conjecture is that 
human perception may be biased to impose a symmetry assumption (in the absence of 
other evidence to the contrary), in order to compute structure. 

• A new algorithm for computing structure from single views of polyedric objects 

Marrill (1991) proposed an iterative algorithm that is capable of recovering structure 
from single views of some simple geometric solids. Sinha (1992) has improved consid¬ 
erably the algorithm and shown that it works well on a wide range of line drawings. 
Our result on structure-from-l-view may explain some of these results in terms of the 
underlying algebraic structure induced by symmetry properties (or other properties, 
for instance constraints on angles). It also yields a new non-iterative algorithm for 
the recovery of structure since it provides (once symmetric n-ple are identified) a sim¬ 
ple algorithm generating a total of 3 linearly independent views to which any of the 
classical S-f-M algorithms can be applied, including the recent linear ones (Huang and 
Lee, 1989). It remains an open question to characterize the connection between the 
minimization principle of Marrill-Sinha and our internal structure constraints. 
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• A practical algorithm for face recognition, based on features 

Assume to have one almost-frontal image per person in the data base. The matrix B 
is synthesized for each person by identifying a set of symmetric pairs (eyes, etc.) and 
performing the operations described earlier on the model view. When a novel view is 
presented: 

1. Assume or infer that the image represent a face 

2. Identify pairs of symmetric points, such as the eyes 

3. Apply to the vector associated to the novel view the operator B(B T B) 1 B T to 
verify recognition. 

• An even more practical algorithm for face recognition, based on “grey”-levels 

Assume to have an almost-frontal quasi-grey-level image per person in the data base 
(Brunelli and Poggio, 1991). Assume that symmetric pairs are identified in the data 
base image. Assume further that four points can be found (such as one eyes, the corners 
of the mouth, and the top of the nose) and matched between the novel view and the 
model view. Then all other points (assuming that faces are sufficiently symmetric!) can 
be matched (disregarding self occlusions) and a distance (or correlation) measure can 
be computed. This technique assumes quasi-constant illumination and is not invariant 
to expression. It is invariant to scaling and pose (modulus self-occlusions). We are 
presently working towards testing and extending this basic technique. It may lead to 
practical applications in model acquisitions and 3D object recognition, since it makes 
possible to combine features and grey levels in an elegant and efficient way. 

• From views to grey-level images 

The obvious way to go from views (see Appendix for definition) to grey-level images is 
through texture mapping (Poggio and Brunelli, 1990). 

• Other definitions and uses of prototypes 

S. Ullman has suggested that it may be wiser to define — instead of the several proto¬ 
types of equation 5 - one single prototype and a small set of “perturbation” vectors. 
This is formally completely equivalent to the formulation of section 3.1, but it may 
better capture the psychophysics of object recognition. 

should also mention, though this is somewhat outside the scope of this paper, 
that it is possible to use prototypes - say of a face - to compute parameters, such as 
illumination and pose, that may help to “normalize” a later recognition step. Poggio 
and Edelman (1990) used a HyperBF network to learn to associate to a 2D view of a 
specific object the correct 3D pose parameters. It seems that a reasonable performance 
may be achieved for similar tasks by using appropriate prototype(s) of the specific 
object (R. Basri also suggested a similar idea). 

• Nonlinear object classes and nonlinear transformations 

The basic idea of Part II - to learn appropriate transformations from instances of the 
same object class — can be applied to object classes other than the linear classes we 
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have defined and characterized. In addition, transformations to be learned may be 
nonlinear or non-uniform (we have only considered linear, uniform transformations on 
2D views): an example is the transformation that changes expression of a face from 
serious to smiling or the transformation that “ages” a face. Nonlinear object classes 
and nonuniform, nonlinear transformation require learning techniques more powerful 
than the linear ones we have considered in Part II. Approximation networks such as 
Hyperbf (Poggio and Girosi, 1990) may be needed. 

• An alternative to elastic templates 

Elastic templates have been used for at least twenty years to perform recognition when 
only one (or very few) templates are available. Elastic templates are equivalent to 
using complex metrics (i.e. cost functionals) that take into account prior knowledge 
about allowed deformations and penalize them accordingly. Though there are tech¬ 
niques, such as Hyperbf (Poggio and Girosi, 1990), that can learn — to some extent — 
the appropriate metric (through the matrix W) from examples, in general the art of 
generating good elastic templates is “black magic”. In addition, elastic templates are 
usually very expensive computationally at run-time (because of the usually non-convex 
minimization problem). A more classical and formally more satisfying approach is to 
have a fixed metric (or almost fixed: certain invariances such as translation for which 
the cost is zero, if valid for the specific problem, should be embedded in the cost func¬ 
tional or in the choice of the input features from the very beginning) and to provide a 
sufficient number of examples of allowed and not allowed deformations. One could then 
use classification or approximation techniques such as Hbf. In some cases, however, 
only very few examples of deformations (or none) are readily available. The idea is 
then to generate artificial examples of deformations for the specific object of interest 
by learning the allowed deformations from a set of examples of objects of the same 
class, using standard approximation techniques. 
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A The 1.5 view theorem and other useful background 
math 

A.l Summary of the Appendix 

This appendix 5 introduces definitions and results that characterize the algebraic structure 
of the views of one 3D object under orthographic projection. Consider the linear vector 
space 3ft 3iv of 3D views of all objects, with a 3D view being the vector of the x , y and z 
coordinates of each of N feature points. Consider the suspace V^ N generated by one view 
of a specific object and by the action on it of the group of uniform linear transformations 
C (i.e. the same linear transformation is applied to each feature point). C is an algebra 
of order 9, and therefore a linear vector space isomorphic to M 3 (that is the space of the 
3x3 matrices with real elements). Thus, is a linear vector space isomorphic to 9ft 9 . 
The projection operator (orthographic projection) that deletes the z components from the 
3D views, maps into a linear vector subspace V £?, isomorphic to 9ft 6 . consist of 
vector with x and y components and can be written as the direct sum Vl N = V N © V N 
where V x and V* are non-intersecting linear subspaces, each isomorphic to 9ft 3 . In addition, 
Poggio (1990) has proved (Basri obtained this result independently, see Ullman and Basri, 
1991) that V* = V y N , which implies that 1.5 snapshots are sufficient for “learning” an object 
(generically) and performing recognition of a novel view. If 3D translations are included, 
a linear subspace, isomorphic to 9ft 2 must be added to the linear space spanned by the 2D 
views of one object. The 1.5 views theorem implies that the x and the y vectors obtained 
from the 2 frames are linearly dependent. This in turn implies that 4 matched points across 
two views are sufficient (generically) to determine 1-D epipolar lines for matching all other 
points. This is an useful result (first obtained in a different context by Huang and Lee, 1989, 
see also Basri, 1991 and Shashua, 1991) in correspondence problems involving 2 frames and 
affine, uniform transformations in 3 D. 

A.2 Introduction 

Basri and Ullman (1989) have recently discovered the striking fact that under orthographic 
projection a view of a 3D object is the linear combination of a small number of views of 
the same object. In this note, we reformulate their results in the more abstract setting of 
linear algebra. This framework makes the result very transparent: the constraint of uniform 
linear transformation (the same linear transformation for each vertex) implies immediately 
that the set of views of an object spans a 9-dimensional linear vector space, independently 
of the number of vertices; orthographic projection preserves linearity while reducing the 
number of dimensions to 6. Simple considerations show that the linear spaces of the x and y 
coordinates are nonintersecting and that each has dimension 3. Furthermore it can be proved 
(Poggio, 1990) that they are equivalent, implying that 1.5 snapshots are sufficient to learn 
the model of one object. We do not consider here the additional constraint of restricting 
the uniform affine transformation to be rigid, i.e. to be a rotation. Rotations generate a 

5 The content of this appendix is from Poggio, 1990 (IRST Technical Report 9005-03, 1990) 
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nonlinear subspace of V™. It is easy to test for rigidity; it is more difficult to understand 
the (nonlinear) algebraic structure (see last section in Poggio, 1990). 


A.3 Any view of a 3D object is a linear combination of a small, 
fixed numer of views 

This section provides the main result of Basri and Ullman (in the second subsection). 

A.3.1 Any 3D-view of an object is a linear combination of 9 views 
Let us define a 3D-view of a specific 3D object as: 


f > 

2/i 

Z\ 

x 2 

2/2 



2 2 


2 In 

\z n ) 

with X 6 SR 3 ”, which is a vector space in the usual way. 

We consider the set of uniform (my definition) linear operators on 9ft 3n , defined by the 
3n x 3n matrices L 3n , where L 3n = <8> L is the tensor product of I„ and L: 


where 


/ L 0 . 0 \ 


VO 0 . Lf 
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^12 
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hz 


^ hi 

hi 

hz) 


is an affine transformation on 9ft 3 . Translation in 3 D space is taken care of separately (see 
later). 

The space of the L 3n operators is a vector space which is isomorphic to the vector space 
of the L matrices. It therefore has a basis of 9 elements independently of n. 

We can express 


3 n 


i=l 


3n 
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where a; can be identified with the appropriate l iy j and Z 3n with the usual basis for Z 3n , i.e. 
with the elementary matrices E, and thus 

X = L 3 ”Xo = E a ,L, 3 "X„ = E a,X, 

»=1 *=1 

where X; are 9 independent 3D views of the specific object, needed to span the 9 elements 
of X, 3 for each coordinate, and X 0 is a particular view chosen as the “initial” view. Thus: 

Theorem A.l The vector space Vy? generated by the action of uniform linear transforma¬ 
tions on a 3D view of a specific object is a 9-dimensional subspace of 3ft 3n , 3 dimensions for 
x, 3 for y and 3 for z. 

Thus any object obi corresponds to a low dimensional subspace V^ D of the space of all 
possible views of all objects 9ft 3n . Of course, ^ 9ft 3n , iff n > 3. In other words, to have 
object specificity, i.e., for this result to be nontrivial , it is necessary that n > 3 (translations 
are supposed to be factored out by using an extra pair). Notice that 9ft 3n = V ob . + +_ 

A.3.2 Any 2D-view of a 3D object is a linear combination of 6 2D-views 
Now consider the orthographic projection P : ft 3n -> 9ft 2n , defined by PX = x, that is 

/ *i 

2/i 

x 2 
2/2 


Zn 
2 In 
Zn 

with P being a linear operator with the matrix representation 

10.0 

0 10.0 

0 0 0 1 0 .. . 0 


. . . 00.100 
0 0 . . ..010 

We define x as the 2D-view of a 3D object. 

The result below follows immediately (6 views span the elements of L in the first 2 rows) 
and is the main result of Basri and Ullman (in a different formulation): 
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Theorem A.2 The vector space given by = PV^f is a six-dimensional subspace of 
3ft 2n (the space of all 2D orthographic views of all 3D objects), i.e. x G b = 

The inclusion of rigid translations is equivalent to the addition of a two-dimensional linear 
subspace (the same for all objects), spanned by the vectors 


lx 



W 


and 



m 

1 

o 

i 


\./ 


A.4 The x and the y coordinates of a view are each a separate 
linear combination of 3 views 

In the previous section we have seen that any 2D-view of a 3D object under orthographic 
projection is the linear combination of 6 2D-views. This section reformulates another ob¬ 
servation of Uliman and Basri: the x coordinates of a 2D-view are a linear combination of 
the x coordinates of 3 2D-views and the y coordinates are a linear combination of the y 
coordinates of 3 2D-views , the two combinations being independent of each other. 

Let us consider a similarity transformation of x: 
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\z n J 

Under this similarity transformation, L 3n becomes a 3x3 matrix of 9 (that i 
Each block is a multiple of / G ft n,n (notice the “isomorphism” to L\). 


where 
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and so on for the other blocks. 

The same argument of section A.3 makes it clear that defining 


/*i\ 


\x n J 
/ 2/i \ 


the following holds: 


1 = 


\yj 


3x3) blocks. 
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f = £ Ui 

i= 1 
3 

V = 'Eshili, 
i =1 

that is, 

Theorem A.3 The subspace spanned by the vectors £ - the x components of x - which 
is a n—dimensional subspace of V^ D (which is 2n-dimensional), is spanned by three views 
of the x coordinates of the object undergoing uniform transformations, i.e., each £ can be 
represented as the linear combination of 3 independent &. The same is true for the 7 ): each 
rj is an independent linear combination of 3 independent ??,•. Again, n > 3 in order for this 
to be non-trivial (since £ = 3ft n for n < 3), once translations are factored out. 

Remark: The basis of ( and the basis of rj depend on the specific object. 

A.5 V x and V y have the same basis, i.e. 1.5 snapshots suffice 

We know from the previous sections that = V* © V y N , where dimV x = dimV y — 3. A 
stronger property holds 

Theorem A.4 (The 1.5 view theorem) V x = V y 

Proof. Assume that V x and V y are not identical (I consider the projections of the x and y 
components expressed originally in the same base in V\. then there is a vector y which is in 
V y and not in V x (or viceversa). Then we can take the 3D view that originated y (through 
orthogonal projection) and apply to it a legal transformation consisting of a rigid rotation 
of 90 degrees in the image plane (such a transformation is in L and therefore is legal). The 
x view of that 3D vector is the y, contradicting the assumption. It follows that V x = V y . 

Remarks 

1. The same argument shows that V x = V y = V z 

2. The same basis of three vectors spans V x , V y and V z (separately). 

3. The property that the x views and the y views of the same 3D object from the same 
snapshot are independent is generic, since if they were dependent, a very slightly 
different object, differing only in the y coordinate of one vertex would have independent 
views (Bruno Caprile, pers. com.). 

4. In general, 1.5 snapshots are sufficient to provide a basis (with n > 3, once translations 
are factored out, in order for this to be nontrivial). 

5. Any 4 vectors from V x and V y are linearly dependent. 
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A.6 A corollary of the 1.5 views theorem: given four matched 

points, correspondance for motion or recognition is easy 

A direct consequence of the above 1.5 views theorem is that the 4 vectors (from 2 orthographic 
views) of the x and y components of an object undergoing an uniform affine transformation 
in 3D (in particular a rigid transformation in 3D) are linearly dependent, that is 

<*1X1 -f- /?iyi -f 0:2X2 + (3 2 y 2 = 0 . 

This implies that the correspondence of at least 4 non coplanar points (including translations) 
in two frames determines epipolar lines for the matching of all other points (the observation 
is due to Ronen Basri, 1991; see also Amnon Sha’shua, 1991; a similar result - but not this 
proof- was first obtained by Lee and Huang, 1988). This means that for each point {x ll y 1 ) 
in frame 1, the corresponding point in frame 2 satisfies the equation 

y = mx + A 

with m = — «2 and A = —(aj®i -f- (3\y\) and = a.i/(3 2 and so on. Translations are taken 
care of by matching one point (the origin of the coordinate systems) in the two frames. Three 
additional “generic” points are needed to solve for aj, a\ and . 

Therefore in problems of matching between 2 frames - in motion or recognition - four 
non coplanar points are sufficient to determine epipolar lines along which the matching of 
the other points can be more easily found. 
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